A modular open-source focused crawler for mining monolingual and bilingual corpora from the web

نویسندگان

  • Vassilis Papavassiliou
  • Prokopis Prokopidis
  • Gregor Thurmair
چکیده

This paper discusses a modular and opensource focused crawler (ILSP-FC) for the automatic acquisition of domain-specific monolingual and bilingual corpora from the Web. Besides describing the main modules integrated in the crawler (dealing with page fetching, normalization, cleaning, text classification, de-duplication and document pair detection), we evaluate several of the system functionalities in an experiment for the acquisition of pairs of parallel documents in German and Italian for the "Health & Safety at work" domain.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

: from Corpus Compilation to Bilingual Terminologies for MT and CAT Tools

This paper describes the TTC Web platform, an online demonstrator to show the whole pipeline to compile bilingual terminologies out of comparable corpora gathered from the web using the tools developed in the TTC project Terminology Extraction, Translation Tools and Comparable Corpora. We present the whole chain which has been integrated into the platform, as well as their main components: a fo...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Domain-Specific Corpus Expansion with Focused Webcrawling

This work presents a straightforward method for extending or creating in-domain web corpora by focused webcrawling. The focused webcrawler uses statistical N-gram language models to estimate the relatedness of documents and weblinks and needs as input only N-grams or plain texts of a predefined domain and seed URLs as starting points. Two experiments demonstrate that our focused crawler is able...

متن کامل

A Particle Swarm Optimizer to Cluster Parallel Spanish-English Short-text Corpora Un Optimizador basado en Cúmulo de Part́ıculas para el Agrupamiento de Textos Cortos de Colecciones Paralelas en Español-Inglés

Short-texts clustering is currently an important research area because of its applicability to web information retrieval, text summarization and text mining. These texts are often available in different languages and parallel multilingual corpora. Some previous works have demonstrated the effectiveness of a discrete Particle Swarm Optimizer algorithm, named CLUDIPSO, for clustering monolingual ...

متن کامل

RIDIRE-CPI: an Open Source Crawling and Processing Infrastructure for Web Corpora Building

This paper introduces the RIDIRE-CPI, an open source tool for the building of web corpora with a specific design through a targeted crawling strategy. The tool has been developed within the RIDIRE Project, which aims at creating a 2 billion word balanced web corpus for Italian. RIDIRE-CPI architecture integrates existing open source tools as well as modules developed specifically within the RID...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013